自动路面遇险分类有助于提高路面维护的效率并降低劳动力和资源的成本。该任务的最近有影响力的分支将路面图像划分为贴片,并从多实体学习的角度解决了这些问题。但是,这些方法忽略了斑块之间的相关性,并且在模型优化和推理中遇到了低效率。同时,Swin Transformer能够以其独特的优势来解决这两个问题。我们构建了Swin Transformer,我们提供了一个名为\ TextBf {p} avement \ textbf {i} mage \ textbf {c} lassification \ textbf {t} ransformer(\ textbf {pict})的视觉变压器。为了更好地利用贴片级别的路面图像的判别信息,提出了\ textit {patch labeling conterg},以利用教师模型在每次迭代期间从图像标签中动态生成贴片的伪标签,并将模型引导到模型上了解补丁的判别特征。 Swin Transformer的广泛分类头可能会稀释特征聚合步骤中遇险斑块的判别特征,这是由于路面图像的遇险面积较小。为了克服这个缺点,我们提出了一个\ textit {Patch Refiner}将补丁聚集到不同的组中,并且仅选择最高的遇险风险组来产生最终图像分类的纤细头部。我们在CQU-BPDD上评估了我们的方法。广泛的结果表明,\ textbf {pict}在检测任务中,p@r中的$+2.4 \%$的大幅度优于第二好的模型,$+3.9 \%\%\%$ f1 $ f1 $ in识别任务和识别任务和1.8倍吞吐量,同时使用相同的计算资源享受7倍的训练速度。我们的代码和模型已在\ href {https://github.com/dearcaat/pict} {https://github.com/dearcaat/pict}上发布。
translated by 谷歌翻译
现代的多层感知器(MLP)模型在不自我注意力的情况下学习视觉表现方面显示了竞争成果。但是,现有的MLP模型不擅长捕获本地细节,并且缺乏人类配置的先验知识,这限制了其骨骼表示学习的模型能力。为了解决这些问题,我们提出了一个名为GraphMLP的简单而有效的图形增强的MLP样结构,该体系结构将MLP和图形卷积网络(GCN)组合在3D人类姿势估计的全球 - 局部 - 单位图形统一体系中。GraphMLP将人体的图结构结合到MLP模型中,以满足域特异性需求,同时允许局部和全局空间相互作用。广泛的实验表明,所提出的GraphMLP在两个数据集(即Human3.6M和MPI-INF-3DHP)上实现了最先进的性能。我们的源代码和预估计的模型将公开可用。
translated by 谷歌翻译
估计单眼视频的3D人类姿势是由于深度模糊和自动阻塞的具有挑战性的任务。大多数现有的作品试图通过利用空间和时间关系来解决这两个问题。然而,这些作品忽略了它是存在多种可行解决方案(即假设)的逆问题。为了减轻这种限制,我们提出了一种多假设变压器(MHFormer),其学习多个合理的姿势假设的时空表示。为了有效地模拟多假设依赖性并构建跨假设特征的强烈关系,任务分解为三个阶段:(i)生成多个初始假设表示; (ii)模型自立通信,将多个假设合并到单个融合表示中,然后将其分组成几个分歧假设; (iii)学习横向假设通信并汇总多假设特征以合成最终的3D姿势。通过上述过程,最终表示增强,合成的姿势更准确。广泛的实验表明,MHFORMER在两个具有挑战性的数据集上实现最先进的结果:Humanet3.6M和MPI-INF-3DHP。没有钟声和吹口哨,其性能超过了以人3.6M的大幅度为3%的最佳结果。代码和模型可在https://github.com/vegetebird/mhformer中找到。
translated by 谷歌翻译
我们提出了一个新颖的深度学习框架,称为迭代优化的补丁标签推理网络(IOPLIN),用于自动检测不仅限于特定的路面困扰,例如裂缝和坑洼。 Ioplin可以通过预期最大化启发的补丁标签蒸馏(EMIPLD)策略进行迭代训练,并通过从路面图像中推断贴片标签来很好地完成此任务。 Ioplin在最先进的单个分支CNN模型(例如Googlenet和ExcelificeNet)上享有许多理想的属性。它能够处理不同分辨率中的图像,并充分利用图像信息,尤其是对于高分辨率图像,因为Ioplin从未修复的图像贴片中提取了视觉特征,而不是整个大小的整个图像。此外,它可以在训练阶段使用任何先前的本地化信息而大致地将路面困扰定位。为了更好地评估我们方法在实践中的有效性,我们构建了一个名为CQU-BPDD的大规模沥青疾病检测数据集,该数据集由60,059个高分辨率路面图像组成,这些数据集在不同的时间从不同地区获取。该数据集的广泛结果证明了Ioplin在自动路面遇险检测中的最先进图像分类方法的优势。 The source codes of IOPLIN are released on \url{https://github.com/DearCaat/ioplin}, and the CQU-BPDD dataset is able to be accessed on \url{https://dearcaat.github.io/CQU -bpdd/}。
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
Neural models with an encoder-decoder framework provide a feasible solution to Question Generation (QG). However, after analyzing the model vocabulary we find that current models (both RNN-based and pre-training based) have more than 23\% inflected forms. As a result, the encoder will generate separate embeddings for the inflected forms, leading to a waste of training data and parameters. Even worse, in decoding these models are vulnerable to irrelevant noise and they suffer from high computational costs. In this paper, we propose an approach to enhance the performance of QG by fusing word transformation. Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words, letting the encoder pay more attention to the repetitive root words. Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type. Such extension can greatly decrease the size of predicted words in the decoder as well as noise. We apply our approach to a typical RNN-based model and \textsc{UniLM} to get the improved versions. We conduct extensive experiments on SQuAD and MS MARCO datasets. The experimental results show that the improved versions can significantly outperform the corresponding baselines in terms of BLEU, ROUGE-L and METEOR as well as time cost.
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译